{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparison with SQL\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "To run this tutorial, we need\n",
    "\n",
    "- A database that is authorized to access, and\n",
    "- A table named `demo` as created with the following SQL command:\n",
    "\n",
    "    ```sql\n",
    "    CREATE TABLE demo AS\n",
    "    SELECT n AS i, n AS j, n AS k\n",
    "    FROM generate_series(0,9) AS n;\n",
    "    ```\n",
    "\n",
    "To create this table, if in a shell environment, [psql](https://www.postgresql.org/docs/current/app-psql.html) can be used.\n",
    "\n",
    "Or, inside a Jupyter Notebook, the SQL command can be executed directly in cells with [ipython-sql](https://pypi.org/project/ipython-sql/), as shown below.\n",
    "\n",
    "First we need to connect to the database (`gpadmin` in our example) specified with the URI using the `%sql` magic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext sql\n",
    "%sql postgresql://localhost/gpadmin"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Authentication methods and credentials can be specified in the URI. Please refer to the [libpq document](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) for detailed usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "vscode": {
     "languageId": "sql"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " * postgresql://localhost/gpadmin\n",
      "Done.\n",
      "10 rows affected.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%sql\n",
    "\n",
    "DROP TABLE IF EXISTS demo;\n",
    "\n",
    "CREATE TABLE demo AS\n",
    "SELECT n AS i, n AS j, n AS k\n",
    "FROM generate_series(0,9) AS n;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the table created successfully, we are now good to go! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting Access to Database\n",
    "\n",
    "To get access to the database we want:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import greenplumpython as gp\n",
    "\n",
    "db = gp.database(uri=\"postgresql://localhost/gpadmin\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the `uri` follows the same specification in the [libpq document](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) as above.\n",
    "\n",
    "As another example, if password is required, the `uri` might look like `postgresql://user:password@hostname/dbname`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Accessing a DataFrame in the Database\n",
    "\n",
    "After selecting the database, we can access a dataframe in the database by specifying its name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>i</th>\n",
       "\t\t<th>j</th>\n",
       "\t\t<th>k</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>3</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>5</td>\n",
       "\t\t<td>5</td>\n",
       "\t\t<td>5</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>9</td>\n",
       "\t\t<td>9</td>\n",
       "\t\t<td>9</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>6</td>\n",
       "\t\t<td>6</td>\n",
       "\t\t<td>6</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>7</td>\n",
       "\t\t<td>7</td>\n",
       "\t\t<td>7</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>8</td>\n",
       "\t\t<td>8</td>\n",
       "\t\t<td>8</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>1</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-----------\n",
       " i | j | k \n",
       "---+---+---\n",
       " 2 | 2 | 2 \n",
       " 3 | 3 | 3 \n",
       " 5 | 5 | 5 \n",
       " 9 | 9 | 9 \n",
       " 0 | 0 | 0 \n",
       " 4 | 4 | 4 \n",
       " 6 | 6 | 6 \n",
       " 7 | 7 | 7 \n",
       " 8 | 8 | 8 \n",
       " 1 | 1 | 1 \n",
       "-----------\n",
       "(10 rows)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = db.create_dataframe(table_name=\"demo\")\n",
    "t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And of course, we can `SELECT` the first ordered N rows of a dataframe, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>i</th>\n",
       "\t\t<th>j</th>\n",
       "\t\t<th>k</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>1</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>3</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>5</td>\n",
       "\t\t<td>5</td>\n",
       "\t\t<td>5</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>6</td>\n",
       "\t\t<td>6</td>\n",
       "\t\t<td>6</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>7</td>\n",
       "\t\t<td>7</td>\n",
       "\t\t<td>7</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>8</td>\n",
       "\t\t<td>8</td>\n",
       "\t\t<td>8</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>9</td>\n",
       "\t\t<td>9</td>\n",
       "\t\t<td>9</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-----------\n",
       " i | j | k \n",
       "---+---+---\n",
       " 0 | 0 | 0 \n",
       " 1 | 1 | 1 \n",
       " 2 | 2 | 2 \n",
       " 3 | 3 | 3 \n",
       " 4 | 4 | 4 \n",
       " 5 | 5 | 5 \n",
       " 6 | 6 | 6 \n",
       " 7 | 7 | 7 \n",
       " 8 | 8 | 8 \n",
       " 9 | 9 | 9 \n",
       "-----------\n",
       "(10 rows)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t.order_by(\"i\")[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Data Manipulation\n",
    "\n",
    "Now we have a dataframe. We can do basic data manipulation on it, just like in SQL.\n",
    "\n",
    "For example, we can `SELECT` a subset of its columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>i</th>\n",
       "\t\t<th>j</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>3</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>5</td>\n",
       "\t\t<td>5</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>9</td>\n",
       "\t\t<td>9</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>6</td>\n",
       "\t\t<td>6</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>7</td>\n",
       "\t\t<td>7</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>8</td>\n",
       "\t\t<td>8</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>1</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-------\n",
       " i | j \n",
       "---+---\n",
       " 2 | 2 \n",
       " 3 | 3 \n",
       " 5 | 5 \n",
       " 9 | 9 \n",
       " 0 | 0 \n",
       " 4 | 4 \n",
       " 6 | 6 \n",
       " 7 | 7 \n",
       " 8 | 8 \n",
       " 1 | 1 \n",
       "-------\n",
       "(10 rows)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_ij = t[[\"i\", \"j\"]]\n",
    "t_ij"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can also `SELECT` a subset of its rows. Say we want all the even numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>i</th>\n",
       "\t\t<th>j</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>6</td>\n",
       "\t\t<td>6</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>8</td>\n",
       "\t\t<td>8</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-------\n",
       " i | j \n",
       "---+---\n",
       " 2 | 2 \n",
       " 0 | 0 \n",
       " 4 | 4 \n",
       " 6 | 6 \n",
       " 8 | 8 \n",
       "-------\n",
       "(5 rows)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_even = t_ij[lambda t: t[\"i\"] % 2 == 0]\n",
    "t_even"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a quick glance, we can `SELECT` the first unordered N rows of a dataframe, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>i</th>\n",
       "\t\t<th>j</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-------\n",
       " i | j \n",
       "---+---\n",
       " 2 | 2 \n",
       " 0 | 0 \n",
       " 4 | 4 \n",
       "-------\n",
       "(3 rows)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_n = t_even[:3]\n",
    "t_n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally when we are done, we can save the resulting dataframe to the database, either temporarily or persistently:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>i</th>\n",
       "\t\t<th>j</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>2</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-------\n",
       " i | j \n",
       "---+---\n",
       " 0 | 0 \n",
       " 2 | 2 \n",
       " 4 | 4 \n",
       "-------\n",
       "(3 rows)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_n.save_as(table_name=\"t_n\", column_names=[\"i\", \"j\"], temp=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Joining Two DataFrames\n",
    "\n",
    "We can also `JOIN` two dataframes with GreenplumPython. For example, suppose we have two dataframes like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>id</th>\n",
       "\t\t<th>val</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>'a'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>'b'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>'c'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>'d'</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "----------\n",
       " id | val \n",
       "----+-----\n",
       "  1 | 'a' \n",
       "  2 | 'b' \n",
       "  3 | 'c' \n",
       "  4 | 'd' \n",
       "----------\n",
       "(4 rows)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows = [\n",
    "    (1, \"'a'\"),\n",
    "    (2, \"'b'\"),\n",
    "    (3, \"'c'\"),\n",
    "    (4, \"'d'\"),\n",
    "]\n",
    "t1 = db.create_dataframe(rows=rows, column_names=[\"id\", \"val\"])\n",
    "t1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>id</th>\n",
       "\t\t<th>val</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>'a'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>'b'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>'a'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>'b'</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "----------\n",
       " id | val \n",
       "----+-----\n",
       "  1 | 'a' \n",
       "  2 | 'b' \n",
       "  3 | 'a' \n",
       "  4 | 'b' \n",
       "----------\n",
       "(4 rows)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows = [\n",
    "    (1, \"'a'\"),\n",
    "    (2, \"'b'\"),\n",
    "    (3, \"'a'\"),\n",
    "    (4, \"'b'\"),\n",
    "]\n",
    "t2 = db.create_dataframe(rows=rows, column_names=[\"id\", \"val\"])\n",
    "t2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can `JOIN` the two dataframe like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>t1_id</th>\n",
       "\t\t<th>t1_val</th>\n",
       "\t\t<th>t2_id</th>\n",
       "\t\t<th>t2_val</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>'a'</td>\n",
       "\t\t<td>3</td>\n",
       "\t\t<td>'a'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>'a'</td>\n",
       "\t\t<td>1</td>\n",
       "\t\t<td>'a'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>'b'</td>\n",
       "\t\t<td>4</td>\n",
       "\t\t<td>'b'</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>'b'</td>\n",
       "\t\t<td>2</td>\n",
       "\t\t<td>'b'</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "---------------------------------\n",
       " t1_id | t1_val | t2_id | t2_val \n",
       "-------+--------+-------+--------\n",
       "     1 | 'a'    |     3 | 'a'    \n",
       "     1 | 'a'    |     1 | 'a'    \n",
       "     2 | 'b'    |     4 | 'b'    \n",
       "     2 | 'b'    |     2 | 'b'    \n",
       "---------------------------------\n",
       "(4 rows)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_join = t1.join(\n",
    "    t2,\n",
    "    on=\"val\",\n",
    "    self_columns={\"id\": \"t1_id\", \"val\": \"t1_val\"},\n",
    "    other_columns={\"id\": \"t2_id\", \"val\": \"t2_val\"},\n",
    ")\n",
    "t_join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating and Calling Functions\n",
    "\n",
    "Calling functions is essential for data analytics. GreenplumPython supports creating Greenplum UDFs and UDAs from Python functions and calling them in Python.\n",
    "\n",
    "Suppose we have a dataframe of numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>val</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>2</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>3</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>5</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>6</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>7</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>8</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>9</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-----\n",
       " val \n",
       "-----\n",
       "   0 \n",
       "   1 \n",
       "   2 \n",
       "   3 \n",
       "   4 \n",
       "   5 \n",
       "   6 \n",
       "   7 \n",
       "   8 \n",
       "   9 \n",
       "-----\n",
       "(10 rows)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows = [(i,) for i in range(10)]\n",
    "numbers = db.create_dataframe(rows=rows, column_names=[\"val\"])\n",
    "numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to get the square of each number, we can write a function to do that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>square</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>0</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>1</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>4</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>9</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>16</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>25</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>36</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>49</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>64</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>81</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "--------\n",
       " square \n",
       "--------\n",
       "      0 \n",
       "      1 \n",
       "      4 \n",
       "      9 \n",
       "     16 \n",
       "     25 \n",
       "     36 \n",
       "     49 \n",
       "     64 \n",
       "     81 \n",
       "--------\n",
       "(10 rows)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@gp.create_function\n",
    "def square(a: int) -> int:\n",
    "    return a**2\n",
    "\n",
    "\n",
    "numbers.apply(lambda t: square(t[\"val\"]), column_name=\"square\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this function is called in exactly the same way as ordinary Python functions.\n",
    "\n",
    "If we also want to get the sum of these numbers, what we need is to write an aggregate function like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>sum</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>45</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "-----\n",
       " sum \n",
       "-----\n",
       "  45 \n",
       "-----\n",
       "(1 row)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@gp.create_aggregate\n",
    "def my_sum(result: int, val: int) -> int:\n",
    "    if result is None:\n",
    "        return val\n",
    "    return result + val\n",
    "\n",
    "\n",
    "numbers.apply(lambda t: my_sum(t[\"val\"]), column_name=\"sum\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Grouping\n",
    "\n",
    "We can use `DataFrame.group_by()` to divide a dataframe into groups by distinct values of columns. \n",
    "\n",
    "Afterwards, we can apply aggregate functions to individual groups to obtain group-wise summary.\n",
    "\n",
    "For example, we can divide numbers into two group based on whether each number is even, and sum each of the groups:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "\t<tr>\n",
       "\t\t<th>sum</th>\n",
       "\t\t<th>is_even</th>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>20</td>\n",
       "\t\t<td>True</td>\n",
       "\t</tr>\n",
       "\t<tr>\n",
       "\t\t<td>25</td>\n",
       "\t\t<td>False</td>\n",
       "\t</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "---------------\n",
       " sum | is_even \n",
       "-----+---------\n",
       "  20 |       1 \n",
       "  25 |       0 \n",
       "---------------\n",
       "(2 rows)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import greenplumpython.builtins.functions as F\n",
    "\n",
    "(\n",
    "    numbers.assign(is_even=lambda t: t[\"val\"] % 2 == 0)\n",
    "    .group_by(\"is_even\")\n",
    "    .apply(lambda t: F.sum(t[\"val\"]))\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,md"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
